Life Ladder: 설문 질문 “10이 누릴수 있는 최고의 삶이고, 0이 누릴수 있는 최저의 삶이라면, 당신의 현재 삶은 몇점인가?”의 나라별 평균 점수 출처: Gallup World Poll
GDP (국내총생산): 2011년 국제 달러 가격으로 계산한 나라별 일년동안 생산활동 출처: World Development Indicators(WDI) (20170915)
Healthy Life Expectancy (건강 기대 수명): 사람의 연령별 사망률, 이환률, 기능적 건강 상태를 고려한 좋은 건강 수준에서 살아갈 수 있을 것으로 기대되는 나라별 평균 나이 출처: World Health Organization (WHO), WDI, 여러 논문의 통계를 통해서, Global Happiness Council에서 계산
Social Support: “필요하면 언제든 힘든 상황일때, 기댈수 있는 친척이나 친구분들이 있나요?” 에 관한 답변 (0 또는 1)에 국가별 평균 출처: Gallup World Poll
Freedom to make life choices: “지금 현재 본인의 삶의 선택의 자유성에 관해서 만족하나요 불만족하나요?” 에 관한 답변 (0 또는 1)에 국가별 평균 출처: Gallup World Poll
Generosity: “지난 한달 동안 자선 단체에 기부하신적이 있나요?”에 관한 국가별 평균 답변을, GDP per capita에 회귀한 잔차 출처: Gallup World Poll, World Happiness Report
Corruption Perception: “정부에 부패가 널리 퍼져있나요?”과 “사업내 비리가 널리 퍼져있나요?”에 관한 답변 (0 또는 1)의 국가별 평균 출처: Gallup World Poll
Positive Affect: “어저께 행복한 감정을 많이 느끼셨나요?”과 “어저께 웃거나 미소를 많이 지으셨나요?”과 “어저께 만족감을 많이 느끼셨나요?” 에 관한 답변 (0 또는 1)의 국가별 평균 출처: Gallup World Poll
Negative Affect: “어저께 걱정을 많이 하셨나요?”과 “어저께 슬픈 감정들을 많이 느끼셨나요?”과 “어저께 화를 많이 느끼셨나요?”의 답변 (각 0 또는 1)의 국가별 평균 출처: Gallup World Poll
Migrant Acceptance Index (거주 이민 수용 수치): 다음 문장에 대해서 개인적으로 좋게 생각하나요 나쁘게 생각하나요? “이민들이 이 나라에서 거주하다” “이민들이 나의 이웃이 되다” “이민들의 가까운 친척과 결혼을 하다” “좋은것이다”는 3점, “상황마다 틀리다” 또는 “잘 모르겠다”는 1 점, “나쁜것이다”는 0 점. 수치는 3가지 답변에 총 점수이다. 세계 평균 MAI는 5.29이다 출처: Gallup World Poll
Democratic Quality WGI에서 다양한 원천 데이터를 통해 만든 Voice and Accountability (VA) 과 Political Stability and Absence of Violence/Terrorism (PV)라는 종합수치의 단순 평균. u = 0, sd = 1, -2.5 에서 2.5의 수치
VA: 국가의 시민들이 정부를 선택 할수있는 자유, 표현의 자유, 단체 결사의 자유, 밑 언론의 자유를 통합한 수치
PV: 정부가 헌법위반이나 폭력적인 수단 (정치적 폭력또는 테러)으로 인해서 와해될 가능성 출처: World Governance Indicator, Global Happiness Council
Delivery Quality WGI에서 다양한 원천 데이터를 통해 만든 Government Effectiveness(GE), Regulatory Quality (RQ), Rule of Law (RL), Control of Corruption (CC)라는 종합수치의 단순 평균. u = 0, sd = 1, -2.5 에서 2.5의 수치
GE: 공공서비스, 정부 업무, 정치적 압박으로부터 독립성, 정책형성, 정책집행, 정책에 관한 정부의 확약에 관한 신뢰성
RQ: 민간부문 발전을 위한 정부의 능력에 관한 인식
RL: 사회규정, 계약 시행, 재산권리, 경찰, 법정들의 관한 신뢰성
CC: 개인 이익을 위한 공권력 사용 또는 정부 부패 출처: World Governance Indicator, Global Happiness Council
# suppressing warning, the warnings have been notified by our team, but was not a problem in the on overall analysis
options(warn=-1)
# importing libraries to local memory
library(pacman)
pacman::p_load(readr,dplyr,gvlma,data.table,car,ggplot2,lm.beta,
corrplot,dplyr,MASS,Hmisc,r2glmm,olsrr,DAAG,highcharter)
# Import data
df <- data.table::fread(paste0('https://raw.githubusercontent.com/Jinwooooo/',
'fcdc-overflow-project-1/master/data/raw-data.csv'))
df$V1 <- NULL
# Remove white space in column names
reset.col <- c('country','year','life.ladder','log.gdp.per.cap','social.support',
'healthy.life.expectancy.at.birth','freedom.to.make.life.choices','generosity',
'perception.of.corruption','positive.affect','negative.affect','confidence.in.govt',
'democratic.quality','delivery.quality','sd.of.ladder.by.country',
'sd.mean.ladder.by.country','gini.index','gini.index.past.mean',
'gini.house.hold.income')
colnames(df) <- reset.col
# Removing unused columns and filling na values with mean values (in order to actually use values as predict at the end)
df.main <- df %>%
dplyr::select(-c(sd.of.ladder.by.country, sd.mean.ladder.by.country)) %>%
dplyr::group_by(year) %>%
dplyr::mutate_at(vars(-country),funs(replace(., which(is.nan(.)), mean(., na.rm = TRUE)))) %>%
dplyr::mutate_at(vars(-country), funs(replace(., which(is.na(.)), mean(., na.rm = TRUE))))
df.main <- df.main %>%
dplyr::group_by(country) %>%
dplyr::mutate_all(funs(replace(., which(is.nan(.)), mean(., na.rm = TRUE)))) %>%
dplyr::mutate_all(funs(replace(., which(is.na(.)), mean(., na.rm = TRUE))))
df.main$gini.house.hold.income <- Hmisc::impute(df.main$gini.house.hold.income,mean)
# [debug] : checking for na values
# colSums(is.na(df.main))
# Box-whisker plot to see life ladder's mean and IQR values per year
ggplot2::ggplot(df.main, aes(x = year, y = life.ladder, group = year)) + ggplot2::geom_boxplot()
ggplot2::ggplot(df.main, aes(x = year, y = log.gdp.per.cap, group = year)) + ggplot2::geom_boxplot()
# to see if it's somewhere near normal distribution
ggplot2::ggplot(df.main, aes(life.ladder)) +
ggplot2::geom_histogram(aes(y = ..density..),
bins = 20,
col = 4,
fill = "blue",
alpha = 0.4) +
ggplot2::geom_density(col = 6) +
ggplot2::labs(x = "life.ladder", y = "count")
# Split into train and test data
df.train <- df.main %>% dplyr::filter(year >= 2005 & year <= 2014)
df.test <- df.main %>% dplyr::filter(year > 2014 & year < 2017)
# Reduce each set to means
df.train.mean <- df.train %>%
dplyr::group_by(country) %>%
dplyr::summarise_all(funs(mean(., na.rm = TRUE))) %>%
dplyr::select(-country)
df.test.mean <- df.test %>%
dplyr::group_by(country) %>%
dplyr::summarise_all(funs(mean(., na.rm = TRUE))) %>%
dplyr::select(-country)
df.train.map <- df.train %>%
dplyr::group_by(country) %>%
dplyr::summarise_all(funs(mean(., na.rm = TRUE))) %>%
dplyr::select(-year)
# data is from highcharter map library, the grouping for continents were off, but wasn't a problem in overall EDA
af.data <- list(region='Africa',
country=(get_data_from_map(download_map_data('custom/africa')))$name)
asia.data <- list(region='Asia',
country=(get_data_from_map(download_map_data('custom/asia')))$name)
eu.data <- list(region='Europe',
country=(get_data_from_map(download_map_data('custom/europe')))$name)
na.data <- list(region='North America',
country=(get_data_from_map(download_map_data('custom/north-america')))$name)
oc.data <- list(region='Oceania',
country=(get_data_from_map(download_map_data('custom/oceania')))$name)
sa.data <- list(region='South America',
country=(get_data_from_map(download_map_data('custom/south-america')))$name)
world.data <- Reduce(function(x, y) merge(x, y, all=TRUE),
list(af.data, asia.data, eu.data, na.data, oc.data, sa.data))
# values that are different were brute force searched and then substituted
# missing.countries <- (df.train.map[!(df.train.map$country %in% map.data$name),])[,1]
# excluded.countries <- (map.data[!(map.data$name %in% df.train.map$country),])$name
df.train.map[df.train.map$country == "Congo (Brazzaville)",1] = 'Democratic Republic of the Congo'
df.train.map[df.train.map$country == "Congo (Kinshasa)",1] = 'Republic of Congo'
df.train.map[df.train.map$country == "North Cyprus",1] = 'Northern Cyprus'
df.train.map[df.train.map$country == "Serbia",1] = 'Republic of Serbia'
df.train.map[df.train.map$country == "Somaliland region",1] = 'Somaliland'
df.train.map[df.train.map$country == "Taiwan Province of China",1] = 'Taiwan'
df.train.map[df.train.map$country == "Tanzania",1] = 'United Republic of Tanzania'
df.train.map[df.train.map$country == "United States",1] = 'United States of America'
df.final <- merge(df.train.map, world.data, by='country', all.x=TRUE)
df.final$region <- as.character(df.final$region)
region.happiness <- c()
for(i in unique(df.final$region)) {
summary <- df.final[df.final$region == i,]
region.happiness <- c(region.happiness,round(mean(summary$life.ladder,na.rm=TRUE),digits=3))
}
df.outer <- (data_frame(name = unique(df.final$region),
y = round(region.happiness,digits=3),
drilldown=tolower(unique(df.final$region))))[1:6,]
df.outer <- df.outer[order(df.outer$y, decreasing=TRUE),]
hc <- highchart() %>%
hc_chart(type='column') %>%
hc_title(text='Drilldown by Life Ladder for Regions') %>%
hc_xAxis(type='category') %>%
hc_legend(enabled=FALSE) %>%
hc_plotOptions(series=list(borderWidth=0,dataLabels=list(enabled=TRUE))) %>%
hc_add_series(data=df.outer, name="Life Ladder",colorByPoint=TRUE)
africa <- df.final[df.final$region == 'Africa',][c('country','life.ladder')]
colnames(africa) <- c('name','value')
africa <- africa[!is.na(africa$name),]
africa <- africa[order(africa$value,decreasing=TRUE),]
africa$value <- round(africa$value, digits=3)
asia <- df.final[df.final$region == 'Asia',][c('country','life.ladder')]
colnames(asia) <- c('name','value')
asia <- asia[!is.na(asia$name),]
asia <- asia[order(asia$value,decreasing=TRUE),]
asia$value <- round(asia$value, digits=3)
europe <- df.final[df.final$region == 'Europe',][c('country','life.ladder')]
colnames(europe) <- c('name','value')
europe <- europe[!is.na(europe$name),]
europe <- europe[order(europe$value,decreasing=TRUE),]
europe$value <- round(europe$value, digits=3)
north.america <- df.final[df.final$region == 'North America',][c('country','life.ladder')]
colnames(north.america) <- c('name','value')
north.america <- north.america[!is.na(north.america$name),]
north.america <- north.america[order(north.america$value,decreasing=TRUE),]
north.america$value <- round(north.america$value, digits=3)
oceania <- df.final[df.final$region == 'Oceania',][c('country','life.ladder')]
colnames(oceania) <- c('name','value')
oceania <- oceania[!is.na(oceania$name),]
oceania <- oceania[order(oceania$value,decreasing=TRUE),]
oceania$value <- round(oceania$value, digits=3)
south.america <- df.final[df.final$region == 'South America',][c('country','life.ladder')]
colnames(south.america) <- c('name','value')
south.america <- south.america[!is.na(south.america$name),]
south.america <- south.america[order(south.america$value,decreasing=TRUE),]
south.america$value <- round(south.america$value, digits=3)
hc %>%
hc_drilldown(allowPointDrilldown = TRUE,
series = list(
list(id = 'africa', data = list_parse2(africa)),
list(id = 'asia', data = list_parse2(asia)),
list(id = 'europe', data = list_parse2(europe)),
list(id = 'north america', data = list_parse2(north.america)),
list(id = 'oceania', data = list_parse2(oceania)),
list(id = 'south america', data = list_parse2(south.america))
)
)
# due to too many maps, other variables have been exlcuded from this report
# for other map visualization on other variables that's not life.ladder, see map-visualization.r file
map.life.ladder <- hcmap("custom/world-robinson-highres", data = df.train.map, value = "life.ladder",
joinBy = c('name','country'), name = 'Life Ladder',
dataLabels = list(enabled = TRUE, format = '{point.name}'),
borderColor = "#000000", borderWidth = 0.1,
tooltip = list(valueDecimals = 3),
nullColor = '#FF0000') %>%
hc_title(text = 'Life Ladder World Map')
map.life.ladder
full model: 모든 변수를 다 포함한 모형. 이중 통계적으로 유의미하지 않은 변수를 빼준다.
# Initial Linear Model
lm.train.mean <- lm(life.ladder ~. -year, data = df.train.mean)
summary(lm.train.mean)
##
## Call:
## lm(formula = life.ladder ~ . - year, data = df.train.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.30833 -0.24882 0.00648 0.26352 0.99401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.664166 0.781652 -0.850 0.39686
## log.gdp.per.cap 0.301018 0.065862 4.570 1.01e-05 ***
## social.support 1.601548 0.512657 3.124 0.00214 **
## healthy.life.expectancy.at.birth 0.007102 0.009833 0.722 0.47128
## freedom.to.make.life.choices 0.743791 0.435332 1.709 0.08961 .
## generosity 0.165156 0.310855 0.531 0.59600
## perception.of.corruption -0.981273 0.308673 -3.179 0.00180 **
## positive.affect 3.440644 0.689705 4.989 1.67e-06 ***
## negative.affect 1.148848 0.656743 1.749 0.08230 .
## confidence.in.govt -0.856688 0.309801 -2.765 0.00641 **
## democratic.quality -0.032719 0.096206 -0.340 0.73427
## delivery.quality 0.092323 0.110959 0.832 0.40671
## gini.index 1.050375 1.353571 0.776 0.43898
## gini.index.past.mean -1.185610 0.985106 -1.204 0.23068
## gini.house.hold.income -1.012306 0.497685 -2.034 0.04372 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4521 on 149 degrees of freedom
## Multiple R-squared: 0.8411, Adjusted R-squared: 0.8261
## F-statistic: 56.33 on 14 and 149 DF, p-value: < 2.2e-16
gvlma::gvlma(lm.train.mean)
##
## Call:
## lm(formula = life.ladder ~ . - year, data = df.train.mean)
##
## Coefficients:
## (Intercept) log.gdp.per.cap
## -0.664166 0.301018
## social.support healthy.life.expectancy.at.birth
## 1.601548 0.007102
## freedom.to.make.life.choices generosity
## 0.743791 0.165156
## perception.of.corruption positive.affect
## -0.981273 3.440644
## negative.affect confidence.in.govt
## 1.148848 -0.856688
## democratic.quality delivery.quality
## -0.032719 0.092323
## gini.index gini.index.past.mean
## 1.050375 -1.185610
## gini.house.hold.income
## -1.012306
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma::gvlma(x = lm.train.mean)
##
## Value p-value Decision
## Global Stat 25.9816 3.192e-05 Assumptions NOT satisfied!
## Skewness 2.1188 1.455e-01 Assumptions acceptable.
## Kurtosis 0.5836 4.449e-01 Assumptions acceptable.
## Link Function 23.0574 1.572e-06 Assumptions NOT satisfied!
## Heteroscedasticity 0.2219 6.376e-01 Assumptions acceptable.
# Selecting independent variable that have relation to the dependent variable (stepwise)
step(lm.train.mean, df.train.mean, direction = "both")
## Start: AIC=-246.15
## life.ladder ~ (year + log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth +
## freedom.to.make.life.choices + generosity + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## democratic.quality + delivery.quality + gini.index + gini.index.past.mean +
## gini.house.hold.income) - year
##
## Df Sum of Sq RSS AIC
## - democratic.quality 1 0.0236 30.472 -248.02
## - generosity 1 0.0577 30.506 -247.84
## - healthy.life.expectancy.at.birth 1 0.1066 30.555 -247.57
## - gini.index 1 0.1231 30.572 -247.49
## - delivery.quality 1 0.1415 30.590 -247.39
## - gini.index.past.mean 1 0.2960 30.745 -246.56
## <none> 30.449 -246.15
## - freedom.to.make.life.choices 1 0.5965 31.045 -244.96
## - negative.affect 1 0.6253 31.074 -244.81
## - gini.house.hold.income 1 0.8455 31.294 -243.66
## - confidence.in.govt 1 1.5627 32.011 -239.94
## - social.support 1 1.9944 32.443 -237.74
## - perception.of.corruption 1 2.0652 32.514 -237.38
## - log.gdp.per.cap 1 4.2687 34.718 -226.63
## - positive.affect 1 5.0855 35.534 -222.82
##
## Step: AIC=-248.02
## life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth +
## freedom.to.make.life.choices + generosity + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## delivery.quality + gini.index + gini.index.past.mean + gini.house.hold.income
##
## Df Sum of Sq RSS AIC
## - generosity 1 0.0684 30.541 -249.65
## - healthy.life.expectancy.at.birth 1 0.1109 30.583 -249.42
## - gini.index 1 0.1363 30.609 -249.29
## - delivery.quality 1 0.1693 30.642 -249.11
## - gini.index.past.mean 1 0.3183 30.791 -248.31
## <none> 30.472 -248.02
## - freedom.to.make.life.choices 1 0.5755 31.048 -246.95
## - negative.affect 1 0.6252 31.098 -246.69
## - gini.house.hold.income 1 0.8242 31.297 -245.64
## - confidence.in.govt 1 1.5403 32.013 -241.93
## - social.support 1 1.9713 32.444 -239.74
## - perception.of.corruption 1 2.3266 32.799 -237.95
## - log.gdp.per.cap 1 4.6013 35.074 -226.96
## - positive.affect 1 5.0620 35.534 -224.82
##
## Step: AIC=-249.65
## life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth +
## freedom.to.make.life.choices + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## delivery.quality + gini.index + gini.index.past.mean + gini.house.hold.income
##
## Df Sum of Sq RSS AIC
## - healthy.life.expectancy.at.birth 1 0.1125 30.653 -251.05
## - gini.index 1 0.1429 30.684 -250.89
## - delivery.quality 1 0.2013 30.742 -250.57
## <none> 30.541 -249.65
## - gini.index.past.mean 1 0.4190 30.960 -249.42
## - freedom.to.make.life.choices 1 0.5840 31.125 -248.54
## - negative.affect 1 0.7559 31.297 -247.64
## - gini.house.hold.income 1 0.7833 31.324 -247.50
## - confidence.in.govt 1 1.4847 32.026 -243.87
## - social.support 1 1.9574 32.498 -241.46
## - perception.of.corruption 1 2.4197 32.960 -239.15
## - log.gdp.per.cap 1 4.6060 35.147 -228.61
## - positive.affect 1 6.8724 37.413 -218.37
##
## Step: AIC=-251.05
## life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices +
## perception.of.corruption + positive.affect + negative.affect +
## confidence.in.govt + delivery.quality + gini.index + gini.index.past.mean +
## gini.house.hold.income
##
## Df Sum of Sq RSS AIC
## - gini.index 1 0.2970 30.950 -251.47
## <none> 30.653 -251.05
## - delivery.quality 1 0.3847 31.038 -251.00
## - freedom.to.make.life.choices 1 0.5679 31.221 -250.04
## - gini.index.past.mean 1 0.7254 31.379 -249.21
## - negative.affect 1 0.8567 31.510 -248.53
## - gini.house.hold.income 1 1.0249 31.678 -247.66
## - confidence.in.govt 1 1.5553 32.209 -244.93
## - social.support 1 2.0424 32.696 -242.47
## - perception.of.corruption 1 2.4603 33.114 -240.39
## - log.gdp.per.cap 1 6.7900 37.443 -220.23
## - positive.affect 1 7.1994 37.853 -218.45
##
## Step: AIC=-251.47
## life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices +
## perception.of.corruption + positive.affect + negative.affect +
## confidence.in.govt + delivery.quality + gini.index.past.mean +
## gini.house.hold.income
##
## Df Sum of Sq RSS AIC
## - delivery.quality 1 0.2975 31.248 -251.90
## <none> 30.950 -251.47
## - gini.index.past.mean 1 0.4405 31.391 -251.15
## - freedom.to.make.life.choices 1 0.5084 31.459 -250.79
## - negative.affect 1 1.1664 32.117 -247.40
## - gini.house.hold.income 1 1.4421 32.392 -246.00
## - confidence.in.govt 1 1.5854 32.536 -245.27
## - social.support 1 1.9594 32.910 -243.40
## - perception.of.corruption 1 2.6472 33.598 -240.01
## - log.gdp.per.cap 1 6.7913 37.742 -220.93
## - positive.affect 1 8.4208 39.371 -214.00
##
## Step: AIC=-251.9
## life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices +
## perception.of.corruption + positive.affect + negative.affect +
## confidence.in.govt + gini.index.past.mean + gini.house.hold.income
##
## Df Sum of Sq RSS AIC
## <none> 31.248 -251.90
## - gini.index.past.mean 1 0.5679 31.816 -250.94
## - freedom.to.make.life.choices 1 0.7175 31.965 -250.18
## - negative.affect 1 1.0945 32.342 -248.25
## - gini.house.hold.income 1 1.3335 32.581 -247.04
## - confidence.in.govt 1 1.8491 33.097 -244.47
## - social.support 1 2.1096 33.357 -243.18
## - perception.of.corruption 1 3.2637 34.512 -237.61
## - positive.affect 1 8.4022 39.650 -214.84
## - log.gdp.per.cap 1 10.6645 41.912 -205.74
##
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support +
## freedom.to.make.life.choices + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## gini.index.past.mean + gini.house.hold.income, data = df.train.mean)
##
## Coefficients:
## (Intercept) log.gdp.per.cap
## -0.5445 0.3502
## social.support freedom.to.make.life.choices
## 1.6200 0.7872
## perception.of.corruption positive.affect
## -1.1493 3.8305
## negative.affect confidence.in.govt
## 1.4255 -0.9072
## gini.index.past.mean gini.house.hold.income
## -0.9835 -1.1608
second model: AIC를 고려한 후의 회귀모형
# Build new model
lm.train.step <- lm(life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth +
freedom.to.make.life.choices + perception.of.corruption +
positive.affect + negative.affect + confidence.in.govt +
gini.index.past.mean + gini.house.hold.income, data = df.train.mean)
summary(lm.train.step)
##
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support +
## healthy.life.expectancy.at.birth + freedom.to.make.life.choices +
## perception.of.corruption + positive.affect + negative.affect +
## confidence.in.govt + gini.index.past.mean + gini.house.hold.income,
## data = df.train.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28765 -0.22980 0.03101 0.25587 1.10491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.892713 0.696774 -1.281 0.202059
## log.gdp.per.cap 0.300777 0.058718 5.122 8.96e-07 ***
## social.support 1.565495 0.501918 3.119 0.002169 **
## healthy.life.expectancy.at.birth 0.012622 0.008594 1.469 0.143976
## freedom.to.make.life.choices 0.782478 0.417068 1.876 0.062541 .
## perception.of.corruption -1.089667 0.288366 -3.779 0.000225 ***
## positive.affect 3.655918 0.604839 6.044 1.10e-08 ***
## negative.affect 1.252073 0.622795 2.010 0.046145 *
## confidence.in.govt -0.853102 0.301644 -2.828 0.005309 **
## gini.index.past.mean -0.789266 0.600414 -1.315 0.190633
## gini.house.hold.income -0.944341 0.474583 -1.990 0.048391 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4488 on 153 degrees of freedom
## Multiple R-squared: 0.8392, Adjusted R-squared: 0.8287
## F-statistic: 79.84 on 10 and 153 DF, p-value: < 2.2e-16
gvlma::gvlma(lm.train.step)
##
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support +
## healthy.life.expectancy.at.birth + freedom.to.make.life.choices +
## perception.of.corruption + positive.affect + negative.affect +
## confidence.in.govt + gini.index.past.mean + gini.house.hold.income,
## data = df.train.mean)
##
## Coefficients:
## (Intercept) log.gdp.per.cap
## -0.89271 0.30078
## social.support healthy.life.expectancy.at.birth
## 1.56550 0.01262
## freedom.to.make.life.choices perception.of.corruption
## 0.78248 -1.08967
## positive.affect negative.affect
## 3.65592 1.25207
## confidence.in.govt gini.index.past.mean
## -0.85310 -0.78927
## gini.house.hold.income
## -0.94434
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma::gvlma(x = lm.train.step)
##
## Value p-value Decision
## Global Stat 24.7157 5.739e-05 Assumptions NOT satisfied!
## Skewness 1.3049 2.533e-01 Assumptions acceptable.
## Kurtosis 0.5750 4.483e-01 Assumptions acceptable.
## Link Function 22.3620 2.258e-06 Assumptions NOT satisfied!
## Heteroscedasticity 0.4737 4.913e-01 Assumptions acceptable.
# Removing variables with little impact (healthy.life, gini.index.past.mean)
lm.train.step2 <- lm(life.ladder ~ log.gdp.per.cap + social.support +
freedom.to.make.life.choices + perception.of.corruption +
positive.affect + negative.affect + confidence.in.govt + gini.house.hold.income,
data = df.train.mean)
summary(lm.train.step2)
##
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support +
## freedom.to.make.life.choices + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## gini.house.hold.income, data = df.train.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.24193 -0.25445 0.02696 0.28740 1.05503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.58285 0.66106 -0.882 0.379308
## log.gdp.per.cap 0.36480 0.04778 7.634 2.17e-12 ***
## social.support 1.58478 0.50488 3.139 0.002031 **
## freedom.to.make.life.choices 0.87281 0.41788 2.089 0.038374 *
## perception.of.corruption -1.23015 0.28409 -4.330 2.66e-05 ***
## positive.affect 3.45735 0.55509 6.229 4.23e-09 ***
## negative.affect 1.28225 0.61130 2.098 0.037566 *
## confidence.in.govt -0.94773 0.30126 -3.146 0.001986 **
## gini.house.hold.income -1.44787 0.42146 -3.435 0.000759 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4531 on 155 degrees of freedom
## Multiple R-squared: 0.8339, Adjusted R-squared: 0.8254
## F-statistic: 97.3 on 8 and 155 DF, p-value: < 2.2e-16
gvlma(lm.train.step2)
##
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support +
## freedom.to.make.life.choices + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## gini.house.hold.income, data = df.train.mean)
##
## Coefficients:
## (Intercept) log.gdp.per.cap
## -0.5828 0.3648
## social.support freedom.to.make.life.choices
## 1.5848 0.8728
## perception.of.corruption positive.affect
## -1.2302 3.4574
## negative.affect confidence.in.govt
## 1.2822 -0.9477
## gini.house.hold.income
## -1.4479
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = lm.train.step2)
##
## Value p-value Decision
## Global Stat 27.125424 1.875e-05 Assumptions NOT satisfied!
## Skewness 1.132482 2.872e-01 Assumptions acceptable.
## Kurtosis 0.005501 9.409e-01 Assumptions acceptable.
## Link Function 24.886980 6.079e-07 Assumptions NOT satisfied!
## Heteroscedasticity 1.100461 2.942e-01 Assumptions acceptable.
# Removing healthy.life.expectany.at.birth, gini.index.past.mean, generosity columns from dataset
df.train.mean2 <- df.train.mean %>%
dplyr::select(-c(healthy.life.expectancy.at.birth, gini.index.past.mean, generosity))
# PQL (Penalized Quasi-Likelihood) <<< this is for one of the models from time series
# Random Intercept (random = ~1|year)
glm.train.mean2 <- MASS::glmmPQL(life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices +
perception.of.corruption + positive.affect + negative.affect +
confidence.in.govt + gini.house.hold.income, ~1 | year, family = Gamma,
data = df.test.mean, verbose = FALSE)
summary(glm.train.mean2)
## Linear mixed-effects model fit by maximum likelihood
## Data: df.test.mean
## AIC BIC logLik
## NA NA NA
##
## Random effects:
## Formula: ~1 | year
## (Intercept) Residual
## StdDev: 2.669923e-07 0.09230614
##
## Variance function:
## Structure: fixed weights
## Formula: ~invwt
## Fixed effects: life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices + perception.of.corruption + positive.affect + negative.affect + confidence.in.govt + gini.house.hold.income
## Value Std.Error DF t-value p-value
## (Intercept) 0.4684654 0.029136930 137 16.078063 0.0000
## log.gdp.per.cap -0.0145448 0.001996504 137 -7.285157 0.0000
## social.support -0.0975307 0.022178397 137 -4.397555 0.0000
## freedom.to.make.life.choices -0.0637479 0.019253580 137 -3.310966 0.0012
## perception.of.corruption 0.0125634 0.009933901 137 1.264701 0.2081
## positive.affect -0.0679222 0.021237152 137 -3.198273 0.0017
## negative.affect -0.0365868 0.024105853 137 -1.517756 0.1314
## confidence.in.govt 0.0326451 0.010574072 137 3.087282 0.0024
## gini.house.hold.income 0.0419657 0.017488778 137 2.399581 0.0178
## Correlation:
## (Intr) lg.g.. scl.sp fr.... prcp.. pstv.f
## log.gdp.per.cap -0.477
## social.support -0.321 -0.438
## freedom.to.make.life.choices 0.034 -0.200 -0.143
## perception.of.corruption -0.566 0.384 -0.072 0.181
## positive.affect -0.209 0.074 -0.207 -0.487 -0.037
## negative.affect -0.328 -0.143 0.355 -0.045 -0.250 0.196
## confidence.in.govt -0.423 0.287 0.166 -0.388 0.379 0.092
## gini.house.hold.income -0.382 0.187 0.253 -0.137 0.127 -0.150
## ngtv.f cnfd..
## log.gdp.per.cap
## social.support
## freedom.to.make.life.choices
## perception.of.corruption
## positive.affect
## negative.affect
## confidence.in.govt 0.060
## gini.house.hold.income -0.219 0.012
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.7297312 -0.5848387 -0.0577856 0.5345243 2.8768158
##
## Number of Observations: 148
## Number of Groups: 3
r2glmm::r2beta(glm.train.mean2)
## Effect Rsq upper.CL lower.CL
## 1 Model 0.796 0.841 0.751
## 2 log.gdp.per.cap 0.249 0.366 0.143
## 3 social.support 0.125 0.233 0.044
## 4 freedom.to.make.life.choices 0.066 0.159 0.010
## 6 positive.affect 0.059 0.150 0.008
## 8 confidence.in.govt 0.052 0.140 0.005
## 9 gini.house.hold.income 0.033 0.111 0.001
## 7 negative.affect 0.015 0.077 0.000
## 5 perception.of.corruption 0.008 0.061 0.000
anova(lm.train.mean, lm.train.step, lm.train.step2)
## Analysis of Variance Table
##
## Model 1: life.ladder ~ (year + log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth +
## freedom.to.make.life.choices + generosity + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## democratic.quality + delivery.quality + gini.index + gini.index.past.mean +
## gini.house.hold.income) - year
## Model 2: life.ladder ~ log.gdp.per.cap + social.support + healthy.life.expectancy.at.birth +
## freedom.to.make.life.choices + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## gini.index.past.mean + gini.house.hold.income
## Model 3: life.ladder ~ log.gdp.per.cap + social.support + freedom.to.make.life.choices +
## perception.of.corruption + positive.affect + negative.affect +
## confidence.in.govt + gini.house.hold.income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 149 30.449
## 2 153 30.813 -4 -0.36465 0.4461 0.77511
## 3 155 31.816 -2 -1.00227 2.4523 0.08956 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# testing if there are outliers that are influencing the R^2 value
car::outlierTest(lm.train.step2, data = df.train.mean)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 137 -2.896202 0.0043272 0.70965
par(mfrow = c(2, 2))
plot(lm.train.step2)
par(mfrow = c(1,1))
# Conclusion : Sri Lanka is an outlier, but not omitted, due to being 1 value out of many
# Correlation plot (visually checking for multicollinearity)
lm.corr <- cor(df.train.mean2[3:ncol(df.train.mean2)])
corrplot::corrplot(lm.corr, method = "number")
1단계 : 회귀모형은 타당한가? 귀무가설 : 회귀모형은 타당하지 않다. 대립가설 : 회귀모형은 타당하다. F-statistic: 98.83 on 8 and 153 DF, p-value: < 2.2e-16 결론 : 유의확률이 0.000이므로 유의수준 0.05에서 회귀모형은 통계적으로 타당하다.
2단계 : 독립변수는 종속변수에게 영향을 주는가? (조건 : 1단계의 결론이 대립가설이어야 함. 이 회귀모형의 경우 대립가설임.) 귀무가설 : 독립변수는 종속변수에게 영향을 주지 않는다(beta1 = 0). 대립가설 : 독립변수는 종속변수에게 영향을 준다(beta1 not equal 0).
Estimate Std. Error t value Pr(>|t|)
log.gdp.per.cap 0.37435 0.04702 7.961 3.59e-13 ***
social.support 1.40026 0.52229 2.681 0.008145 **
freedom.to.make.life.choices 0.98169 0.45543 2.156 0.032684 *
perception.of.corruption -1.20868 0.28250 -4.278 3.30e-05 ***
positive.affect 3.44207 0.55740 6.175 5.69e-09 ***
negative.affect 1.17269 0.60926 1.925 0.056113 .
confidence.in.govt -0.99422 0.30707 -3.238 0.001477 **
gini.house.hold.income -1.54129 0.41282 -3.734 0.000266 ***
결론 : 유의확률이 0.000이므로 유의수준 0.05에서 모형에 포함된 독립변수들은 종속변수에게 모두 통계적으로 유의한 영향을 주는 것으로 나타났다. 즉 log.gdp.per.cap, freedom.to.make.life.choices, generosity, perception.of.corruption, confidence.in.govt는 모두 life ladder (행복지수) 에 통계적으로 유의한 영향을 준다.
** 3단계 : 독립변수는 어떠한 영향을 주는가? ** (조건 : 2단계의 결론이 대립가설이어야 함. 이 회귀모형의 경우대립가설임.)
Estimate
log.gdp.per.cap 0.37435
social.support 1.40026
freedom.to.make.life.choices 0.98169
perception.of.corruption -1.20868
positive.affect 3.44207
negative.affect 1.17269
confidence.in.govt -0.99422
gini.house.hold.income -1.54129
(다른 독립변수가 모두 고정되어 있을때!!!) 1) 독립변수인 log.gdp.per.cap 의 기본단위가 1 증가하면, 종속변수인 ladder는 약 0.37435 정도 증가한다. 2) 독립변수인 social.support가 1 증가하면, 종속변수인 ladder는 약 1.40026 증가한다. 3) 독립변수인 freedom.to.make.life.choices가 1 증가하면, 종속변수인 ladder는 약 0.98169 증가한다. 4) 독립변수인 perception.of.corruption가 1 증가하면, 종속변수인 ladder는 약 1.20868 감소한다. 5) 독립변수인 positive.affect가 1 증가하면, 종속변수인 ladder는 약 3.44207 증가한다. 6) 독립변수인 negative.affect가 1 증가하면, 종속변수인 ladder는 약 1.17269 증가한다. 6) 독립변수인 confidence.in.govt 가 1 증가하면, 종속변수인 ladder는 약 0.99422 감소한다. 7) 독립변수인 gini.house.hold.income가 1 증가하면, 종속변수인 ladder는 약 1.54129 감소한다.
** 4단계 : 회귀모형의 설명력 = 독립변수의 설명력 ** Multiple R-squared: 0.8379, R^2 = SSR / SST : 결정계수(Coefficient of Determination) = 0.8379 회귀모형의 설명력은 약 83.79% 또는 독립변수들의 설명력은 약 83.79% 종속변수의 다름이 100%라고 하면 독립변수가 종속변수의 다름을 약 83.79% 설명하고 있다(독립변수 때문에 종속변수의 다름이 약 83.79% 발생했다)
** 세부분석 ** # 다중공선성(Multicolinearity) : 독립변수들 간의 상관관계를 파악함 VIF(Variation Inflation Factor) = 분산팽창인자 = 분산팽창요인 VIF 값이 10 이상이면 독립변수들 간에 다중공선성(선형관계)이 존재한다고 파악 -> 회귀분석 결과 해석 못 함.
car::vif(lm.train.step2)
## log.gdp.per.cap social.support
## 2.638470 2.724857
## freedom.to.make.life.choices perception.of.corruption
## 2.588469 1.758110
## positive.affect negative.affect
## 2.420524 1.476022
## confidence.in.govt gini.house.hold.income
## 1.670271 1.417159
3개의 VIF 값이 10 이상인 값이 하나도 없으므로 독립변수들 간에 다중공선성은 존재하지 않는다.
영향력이 가장 큰 독립변수는? 독립변수의 단위가 모두 같다면, 회귀계수의 절대값이 가장 큰 것이 가장 영향력이 크다. 독립변수의 단위가 다르면, 표준화된 회귀계수의 절대값이 가장 큰 것이 가장 영향력이 크다. lm.beta::lm.beta(lm.result) 를 사용해서 표준화된 회귀계수 값을 구한다. lm.beta는 표준화된 회귀계수를 구해준다.
lm.beta::lm.beta(lm.train.step2)
##
## Call:
## lm(formula = life.ladder ~ log.gdp.per.cap + social.support +
## freedom.to.make.life.choices + perception.of.corruption +
## positive.affect + negative.affect + confidence.in.govt +
## gini.house.hold.income, data = df.train.mean)
##
## Standardized Coefficients::
## (Intercept) log.gdp.per.cap
## 0.00000000 0.40589351
## social.support freedom.to.make.life.choices
## 0.16959329 0.10998771
## perception.of.corruption positive.affect
## -0.18792301 0.31717351
## negative.affect confidence.in.govt
## 0.08341066 -0.13307210
## gini.house.hold.income
## -0.13385792
(Intercept) log.gdp.per.cap social.support freedom.to.make.life.choices
0.00000000 0.41760768 0.16409732 0.10531958
perception.of.corruption positive.affect negative.affect confidence.in.govt
-0.18943523 0.31779551 0.08004477 -0.13943904
gini.house.hold.income
-0.13148714
결론 : life.ladder (행복지수)에 log.gdp.per.cap (beta = 0.5726012)가 가장 영향력이 크고, 그 다음으로 positive.affect(beta=0.31779551), perception.of.corruption (beta = -0.189435), social support (beta=0.16409), confidnece in government (beta = -0.13943904), gini.house.hold.income (beta = -0.13148), freedom.to.make.life.choices (beta=0.1053), negative.affect (beta=0.0800), generosity (beta=0.0615), 순서로 나타났다.
회귀분석을 위한 가정(Assumption)들은 만족하는지를 확인
종속변수가 정규분포를 따른다면, 잔차(residual value) 또한 정규분포를 따르며, 평균은 0 이다. i. Normal Q-Q plot : x축은 잔차의 사분위수(Quartiles), y축은 잔차의 표준화. 정규성 가정을 만족한다면 Normal Q-Q plot는 직선에 가까운 그래프가 되어야 한다. 직선을 벗어나는 점들이 많으면 정규성 가정이 깨질 가능성이 높다.
qqnorm(lm.train.step2$residuals)
qqline(lm.train.step2$residuals)
# Testing for autocorrelation of errors
olsrr::ols_test_correlation(lm.train.step2) # 0.995226
## [1] 0.996192
# method 1: Testing if residuals are normally distributed
shapiro.test(lm.train.step2$residuals)
##
## Shapiro-Wilk normality test
##
## data: lm.train.step2$residuals
## W = 0.99148, p-value = 0.4389
# method 2: Testing if residuals are normally distributed
# olsrr::ols_test_normality(lm.train.step2)
qqp(lm.train.step2, data = df.train.mean2)
## [1] 60 137
x축 : 독립변수 또는 종속변수의 추정치 y축 : 잔차 또는 표준화 잔차 분산이 일정하다는 가정을 만족한다면 잔차의 값이 0을 기준으로 위/아래로 랜덤random)하게 분포되어 있어야 한다. 일정한 패턴을 가지면 등분산성은 깨진다.
p-value가 0.7112, 즉 0.05보다 크다. 귀무가설: 잔차의 값이 0을 기준으로 위/아래로 랜덤random)하게 분포되어 있다. 등분산성 조건 만족.
# Check for homoscedasticity
car::ncvTest(lm.train.step2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.204514 Df = 1 p = 0.6511016
spreadLevelPlot(lm.train.step2)
##
## Suggested power transformation: 1.349126
# Conclusion : No change required
종속변수는 서로 독립적이어야 한다. 관찰값들 사이에 상관관계가 있을 경우, 오차항들이 서로 독립이라는 조건을 검토해 보아야 한다. 만일 오차항들이 서로 독립이라면, 잔차들은 무작위로 흩어져 있을 것이고, 그렇지 않다면 무작위로 흩어져 있지 않아 오차항들 사이에 상관관계가 있다고 할 수 있을 것이다. i. 더빈-왓슨 통계량(Durbin-Watson Statistics) 오차항의 독립성을 평가하는 한 측도로 더빈-왓슨(Durbin-Watson) 통계량이 있다. DW 통계량은 잔차들의 상관계수를 측정함 DW 통계량이 2에 가까우면 인접 오차항들 사이에 상관관계가 없는 것을 의미하며, DW 통계량이 4에 가까우면 음의 상관관계가 있고 DW 통계량이 0에 가까우면 양의 상관관계가 있는 것으로 평가한다. 귀무가설 : 각 오차들은 서로 독립이다. 대립가설 : 각 오차들은 서로 독립이 아니다.
p-value가 0.4426, 즉 0.05보다 크다. 귀무가설: 각 오차들은 서로 독립이다. 독립성 조건 만족.
lmtest::dwtest(lm.train.step2)
##
## Durbin-Watson test
##
## data: lm.train.step2
## DW = 1.991, p-value = 0.4785
## alternative hypothesis: true autocorrelation is greater than 0
종속변수와 독립변수가 선형관계에 있다면 잔차와 독립변수 사이에 어떤 체계적인 관계가 있으면 안 된다.
par(mfrow = c(2, 2))
plot(lm.train.step2)
Estimate
log.gdp.per.cap 0.37435
social.support 1.40026
freedom.to.make.life.choices 0.98169
perception.of.corruption -1.20868
positive.affect 3.44207
negative.affect 1.17269
confidence.in.govt -0.99422
gini.house.hold.income -1.54129
life.ladder = (0.37435 log.gdp.per.cap) + (0.98169 freedom.to.make.life.choices) + (0.43866 generosity) + (-1.20868 perception.of.corruption) + (-0.99422 confidence.in.govt) + (1.40026 social.support) + (3.44207 positive.affect) + (-1.54129 gini.house.hold.income) + (1.17269 negative.affect) + (1.40026 social.support)
NA값 제외 후 총 Data Point : 148 예측안에 들어가는 결과 : 148 결론 : 모든 Test Data는 Train Data로 만들어진 Final Linear Model 예측안에 들어간다
# matching train colnames to test colnames
train.colnames <- colnames(df.train.mean2)
test.colnames <- colnames(df.test.mean)
predict.colnames <- test.colnames[train.colnames %in% test.colnames]
pred_probs <- predict(lm.train.step2, newdata = df.test.mean[, predict.colnames], interval = "predict", type = "response")
dt.pred_prob <- as.data.table(pred_probs)
dt.pred_prob$result <- ifelse((dt.pred_prob$lwr <= dt.pred_prob$fit) & (dt.pred_prob$upr >= dt.pred_prob$fit), 1, 0)
table(dt.pred_prob$result)
##
## 1
## 148
# MSE of our model
mean(lm.train.step2$residuals^2)
## [1] 0.1939984